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Abstract 

Background: The evaluation of associations between genotypes and diseases in a case-control framework plays an 
important role in genetic epidemiology. This paper focuses on the evaluation of the homogeneity of both genotypic 
and allelic frequencies. The traditional test that is used to check allelic homogeneity is known to be valid only under 
Hardy-Weinberg equilibrium, a property that may not hold in practice. 

Results: We first describe the flaws of the traditional (chi-squared) tests for both allelic and genotypic homogeneity. 
Besides the known problem of the allelic procedure, we show that whenever these tests are used, an incoherence 
may arise: sometimes the genotypic homogeneity hypothesis is not rejected, but the allelic hypothesis is. As we 
argue, this is logically impossible. Some methods that were recently proposed implicitly rely on the idea that this does 
not happen. In an attempt to correct this incoherence, we describe an alternative frequentist approach that is 
appropriate even when Hardy-Weinberg equilibrium does not hold. It is then shown that the problem remains and is 
intrinsic of frequentist procedures. Finally, we introduce the Full Bayesian Significance Test to test both hypotheses and 
prove that the incoherence cannot happen with these new tests. To illustrate this, all five tests are applied to real and 
simulated datasets. Using the celebrated power analysis, we show that the Bayesian method is comparable to the 
frequentist one and has the advantage of being coherent. 

Conclusions: Contrary to more traditional approaches, the Full Bayesian Significance Test for association studies 
provides a simple, coherent and powerful tool for detecting associations. 

Keywords: Allelic homogeneity test, Bayesian methods, Chi-squared test, Hardy-Weinberg equilibrium, FBST, 
Monotonicity 



Background 

One of the main goals in genetic epidemiology is the 
evaluation of associations between specific genotypes or 
alleles and a certain disease. Association studies are usu- 
ally performed in a case-control framework in which one 
or several polymorphisms of candidate genes are eval- 
uated in a group of cases (that is, patients that have a 
disease) and in a group of controls from the same popu- 
lation (that is, healthy individuals) [1]. The frequencies of 
each of the genotypes are then computed so that statis- 
tical tests that aim at checking for associations between 
genes and the disease can be performed. The population 
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studied usually must be homogeneous regarding ethnic- 
ity, gender distribution and other factors that may bias the 
results rendering false-positive associations. See [2] for 
nontechnical summary of reasons that may render false 
discoveries in case-control studies and [3] for a theoret- 
ical analysis of the consequences of population stratifi- 
cation. For more on case-control studies, the reader is 
referred to [4]. 

Several statistical tests are usually employed for this sce- 
nario. Among them, Cochran- Armitage test for trends [5], 
homogeneity chi-square tests for contingency tables of 
both genotypic and allelic frequencies [6], likelihood ratio 
tests and Wald tests [7] are performed. See, for exam- 
ple, [8] and [9] for a summary of these tests. Some of 
these statistics are specifically designed to work under 
assumptions such as dominance models, recessive models 
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or Hardy-Weinberg Equilibrium (HWE). However, a big 
importance is being given on new methods that are robust 
to model misspecification, mainly because power is usu- 
ally small when the model is wrong and type 1 error rates 
are usually incorrect (see e.g. [7,10,11])- 

HWE plays an important role in genetic studies, in par- 
ticular when testing for allelic homogeneity [12]. The main 
reason is that the traditional test for allelic homogeneity 
fails when HWE does not hold, a point to which we will 
get back later In words, HWE is a constrain on the geno- 
typic proportions that implies, under some assumptions, 
stability of the different genotypes over the generations of 
the population (see e.g. [13] and [14]). These assumptions 
include, for example, random mating between individu- 
als. For many diseases, random mating is not expected to 
be satisfied. The same holds for other conditions required 
for HWE, that in practice may be unrealistic in some 
situations. In fact, as stated by [15], "a population will 
never be exactly in HWE" Hence, the need to design tests 
that are robust to departures from HWE is evident. A 
common practice in such problems is to first test HWE, 
discarding genes that are not in equilibrium. This is done 
in an attempt of identifying genotyping errors. Such an 
approach should be avoided, as discussed by [12]. The 
main reason is that the 2 steps procedure alters type-1 
errors. They also emphasize that the correct way to deal 
with this problem is to inherently account for deviations 
from HWE with adjusted tests, the approach we take here. 

In the present paper, we focus on two hypotheses: 1. 
homogeneity of the genotypic frequencies; and 2. homo- 
geneity of the allelic frequencies. Usually, data in such 
studies are summarized in two different ways [9]. The 
first one consists of a table with the genotypic frequen- 
cies of case and control groups. The second, a table with 
the allelic frequencies. Tables 1 and 2 illustrate this rep- 
resentation using data presented in [16], which was also 
considered by [12]. Their study was designed to test the 
hypothesis that GABA^ sub genes would contribute to a 
disorder due to methamphetamine use. It is worth not- 
ing that Table 2 contains twice as many observations as 
Table 1. [9] discusses in details the problem of doubling 
the sample size. In particular, it is shown that methods that 
"treat alleles as individual entities" [9] have wrong nom- 
inal type-1 errors when HWE does not hold. We must 
recall that the power of a test can increase considerably 
by increasing the sample size, a nominal increase that can 
be misleading when it is not reasonable to "treat alleles 



Table 1 Genotypic frequencies 



Group 


AA 


AB 


BB 


Total 


Case 


55 


83 


50 


188 


Control 


24 


42 


39 


105 



Genotypic frequencies for thie data set presented in [1 6]. 



Table 2 Allelic frequencies 



Group 


A 


B 


Total 


Case 


193 


183 


376 


Control 


90 


120 


210 



Allelic frequencies derived from Table 1 . 



as individual entities'! This issue we will be discussed in 
further details later. 

The aims of this paper are four-fold: 1 - to describe 
how the analysis of such data is usually conducted and 
to emphasize its known flaw (namely lack of robustness 
to departures from HWE); 2 - to describe one exact fre- 
quentist approach which is correct from a classical point 
of view; 3 - to present a Bayesian method to deal with 
the problem, and 4 - to advocate the use of the Bayesian 
solution by demonstrating why this is the best solution 
compared to the others. The main argument is based 
on an undesirable logical inconsistency that can happen 
whenever p-values are used to test nested hypotheses. We 
prove that this does not happen when using the Bayesian 
method proposed. We also show that the Bayesian and 
the correct frequentist solutions have comparable power. 
Simulations and analyses of real data are shown in order 
to illustrate the problem. 

The paper is organized as follows. Section Methods con- 
tains three subsections: Usual Procedures, which intro- 
duces the notation that is used throughout the paper, 
discusses the usual methods to deal with the problem and 
argues why the test for allelic homogeneity is wrong when 
there are departures from HWE; A Different Frequentist 
Test, which introduces a frequentist test that works even 
when departures of HWE happen and Bayesian Solution, 
which introduces the FBST approach to solve the prob- 
lem. Section Results and Discussion first focuses on the 
issue of the logical incoherence that happens when using 
the frequentist procedures discussed in the paper and also 
shows that the same does not happen to the Bayesian 
method FBST. A brief discussion on Bayes factors is also 
provided. Finally, we address the question of whether the 
Bayesian method has good frequentist properties. Section 
Conclusions summarizes the findings of the paper. 

Methods 

Here, we formally describe three different approaches to 
deal with the problem described: the usual procedure, a 
correct frequentist proposal and a Bayesian solution. 

Usual Procedures 

We begin by describing the statistical model that is used to 
deal with the problem approached in this paper (namely, 
product of multinomials) and also how the hypotheses of 
interest are usually tested in genetic literature. For more 
details, see [9]. 
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Let G = {AA,AB,BB} be the set of all possible geno- 
types for the locus of interest. As in Table 3, denote by 
X = (Xaa,Xab,Xbb) and Y = (Yaa, Yab, Ybb) the ran- 
dom vectors with the genotypic frequencies from the 
case and control groups sample, with X!ieG^< — ^ ^"^^ 
^.gjj Yi = m being the total of individuals observed in 
each group. Also, let y = (yaa, Yab, Ybb), where y,- is 
the probability that an individual from the case group has 
genotype i, and it = (TtAAtytABtytBs), where tt; is the 
probability that an individual from the control group has 
genotype i, i e G. The parametric space is 



© = 



{yaa, Yab, Ybb, ^aa, ^ab, ^bb) e Ct^ : ^ Yi 

i€G 



i€G J 

Considering observations from different individu- 
als to be statistically independent, we have X\d ~ 
Multinomial{n,y) and Y\d ^ Multinomial{m,K) with 
itX and itY being conditionally independent as well. The 
likelihood function is then given by 

L{0;x,y) oc vt' 0 ^f' ^ ^ 0, 

which is the product of two multinomial distributions. 

The first hypothesis to be tested (null hypothesis), 
namely that there is no difference in genotypic frequencies 
between the groups, may be formally expressed as 

H^:Y^ n. (1) 

The usual procedure to test is the chi-square test, 
i.e., the test based on the statistic 



i^{AAAB,BB] 



Xf 



G^2 



Y^) 



Yf 



Xf = nef, Yf = m0f, 

where Of is the maximum likelihood estimator for the 
genotypic frequency i under the hypothesis Hq . Under 
Hq, Q'^ has asymptotic distribution x| (chi-square dis- 
tribution with 2 degrees of freedom). Using this fact, it is 
possible to calculate an asymptotic p-value. If one prefers 
exact tests, Monte Carlo methods can also be used. To 
sum up, in order to test the first hypothesis, one usu- 
ally performs a traditional chi-square test of homogeneity 
to Table 3. 



The second hypothesis states that there is no dif- 
ference in allelic frequencies between the groups. This 
hypothesis - which will be made formal in the next 
section - is usually tested by considering the allelic fre- 
quencies in both samples, Xa = 2Xaa + Xab and Ya = 
2Yaa + Yab, as in Table 4 and applying the chi-square 
test of homogeneity to that table, which has twice as many 
observations as Table 4. 

More formally, the statistic considered is 



ie{A,B] 



(Xi-Xff (Yi-Yf) 



A\2 



Xf 



+ 



xf = 2nkf, Yf = 2mkf, 

where kf is the maximum likelihood estimator for the 
allelic frequency i under the hypothesis that allelic fre- 
quencies are the same in both groups. This statistic is 
then compared to a Xi distribution, or sampled using a 
Monte Carlo method to calculate the p-value. However, 
in this scenario, the distribution of the test statistic under 
the null hypothesis is not chi-square unless alleles are sta- 
tistically independent. In other words, the distribution is 
chi-square only if a product multinomial model can be 
applied to Table 4. Essentially, this independence corre- 
sponds to the HWE. In fact, [9] formally proves that this 
is a valid test if, and only if, both groups, case and control, 
are under HWE. Otherwise, this test is biased: nominal 
level of significance is different from the real one [17]. [17] 
also shows how deviations from HWE alter type-I error 
rates, a point that will also be illustrated in Section Results 
and Discussion. Therefore, this test should not be used. 
It is important to note that despite being wrong, it is still 
widely used in genetic literature nowadays (see e.g. [18], 
that also discusses some aspects of the lack of robust- 
ness of this test). This leads to a larger number of false 
conclusions than the nominal errors of the procedures. 

Applying the traditional tests to data from Table 1, one 
gets a p-value of 0.152 for genotypic association and of 
0.049 for allelic association. This means that the evidence 
we have that the two groups are in genotypic homogene- 
ity is larger than the evidence we have that they are in 
allelic homogeneity. However, if genotypic proportions are 
the same, allelic proportions must also be the same. This 
implication will be made formal in Section Results and 
Discussion. In practice, the first p-value being larger than 
the second implies that one can accept the hypothesis of 



Table 3 Population genotypic frequencies Table 4 Population allelic frequencies 



Group 


AA 


AB 


BB 


Total 


Group 


A 


B 


Total 


Case 




^AR {yab) 


xbb (ybb) 


n 


Case 


Xa = 2xa4 + Xab 


xg = 2xgB -1- Xab 


2n 


Control 




Yab (^ab) 


ysB (^bb) 


m 


Control 


YA = 2yaa -i-YAB 


Y8 = ^YBB -i- Yab 


2m 



Genotypic frequencies (probabilities). Allelic frequencies from Table 3. 
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genotypic homogeneity while rejecting allelic homogene- 
ity, which is a contradiction. For instance, this is the case 
when the level of significance is 10%, as 0.152 > 0.1 but 
0.049 < 0.1. To summarize: we are testing two nested 
hypotheses, that is, the nature of the problem is such that 
the first hypothesis implies the second. However, even 
though we reject the second, we do not reject the first. 
Does the contradiction happen because the allelic test is 
wrong? Next Section answers this by presenting an exact 
test for allelic homogeneity that is valid even if HWE does 
not hold. 

A Different Frequentist Test 

Some attempts to correct the above-mentioned allelic test 

so that it works even when HWE assumption is not met 
are considered by [8,12,17,19]. See [18] for a summary of 
these. Also, see [20], that proofs that the test proposed by 
[17] to correct for departures of HWE is equivalent to the 
one proposed by [9], which is the Armitages trend test [5]. 

Here we show another solution that has the advantages 
of being exact, unconditional, and that it can also be calcu- 
lated in a computationally efficient way, even for large data 
sets. Moreover, it is defined in the same parametric space 
© as the genotypic test. Essentially, this test is derived by 
noticing that the hypothesis that allele frequencies are the 
same in both groups can be written in terms of the original 
parametric space as 



A 1 1 

: Yaa + 'jVAB = ^aa + i^ab- 



(2) 



Note that this formulation is always true independent 
of the Hardy- Weinberg equilibrium restriction and does 
not involve changing neither the sample space nor the 
parametric space. 

The chi-square statistic may be used to test this 
hypothesis: 



i^[AAABM] 



(Xi 



-A* 



+ 



yA^^jl 



-A* 



Xf = HYi , Yi = miti. 



Here, yf* and itf* are the maximum likelihood esti- 
mators of genotypic frequencies under H^. They can be 
found by maximizing 



L($;x,y) oc 

^_AB _ yAB.^XAA,MB, 
1 2 



{TtAA^-^- i^fA^y^lfil -jtAA-"^- ^f'' 



XTt^J^^'ABC^-^AA-TTABY'' 



(3) 



and then using the relations 

Ya* = Yaa + \ yab; yi* = ybb + \ yab; 

^A* = ^AA + I^AB; ^b* = ^BB + 2^AB- 



Maximization of Equation (3) can be efficiently done by 
using numerical methods such as Newton's method [21], 
which are already implemented in most statistical and 
mathematical softwares such as R and MATLAB. To cal- 
culate p-values, the statistic Q^* can then be compared to 
a Xj distribution or, if one wishes to perform an exact test 
(the approach we take here), sampled using Monte Carlo 
methods. That is, one can generate several values of Q^* 
under the null hypothesis and compute the proportion 
of these that are larger than the observed statistic on the 
sample. This is the (estimate of the) exact p-value. Confi- 
dence intervals can be obtained for it by using a normal 
approximation to the binomial distribution. Note that the 
dimension of the parametric space is 4 and under the null 
hypothesis it becomes 3. Hence, the number of degrees of 
freedom of the distribution of the chi-square statistic is 
dim(C")) — dim(//Q ) =4 — 3 = 1. This is also the number 
of degrees of freedom of the chi-square for the allelic test 
described before. 

This test is very similar to the ones recently introduced 
by [7], except that the statistics used are different (Wald 
statistic, score statistic and maximum profile likelihood 
ratio), and results are asymptotic: chi squared approxima- 
tion is used. Even though these tests are asymptotically 
equivalent, in order to illustrate our points it is important 
to have exact tests here. 

The allelic p-value for data from Table 3 is 0.069. It is 
surprising that despite the fact this test is correct, this 
p-value is still smaller than 0.152 - the p-value for geno- 
typic association. Hence, incoherence remains even when 
correcting the traditional allelic test. We note that the p- 
value found by [12] for this same data set using corrected 
allelic test is 0.066, which also does not remove the con- 
tradiction. Note that here we use the exact test, hence this 
is not a problem of using an approximation. In the Section 
Results and Discussion we present other data sets in which 
this incoherence happens, showing that this problem is 
not unique to the particular data we chose to illustrate 
the point. Next Section is devoted to present a framework 
where this kind of contradiction does not happen. 

Bayesian Solution 

Bayesian methods are the alternative inductive way to 
deal with such a problem. These methods are widely used 
nowadays because they allow prior knowledge from the 
researcher and scientific community to be incorporated 
into the analysis (see [22] for applied examples of these 
methods in genetics) and, contrary to usual classical pro- 
cedures, they do not require large samples for the analysis 
to be correct. That is, optimality of the procedure does 
not rely on asymptotic considerations. Many Bayesian 
methods designed to deal with precise hypotheses, i.e., 
hypotheses which have lower dimension than the para- 
metric space, have been developed. Precise hypotheses 
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must have a different treatment in Bayesian statistics: in 
general, they have zero posterior probability, so that they 
would always be rejected when using traditional methods. 
One way to deal with this problem is to assign a positive 
prior probability to null hypothesis [23], but this may seem 
a rather ad hoc solution and may lead to some inconsis- 
tencies [24]. Another approach is to use Bayes factors [25], 
a point to which we will get back later in the paper (see 
Section Results and Discussion). 

In this paper, we choose to use the FBST {Full Bayesian 
Significance Test), a procedure introduced by [26]. This 
method was also used by [27]. The test is based on the e- 
value statistic, a Bayesian measure of evidence designed 
to evaluate sharp null hypotheses. In order to apply this 
method, we begin by specifying a prior density in the com- 
plete parametric space &,f{0). We note that it is not 
necessary to attribute different probability to each of the 
hypothesis: it is only necessary to specify f(fi). This is not 
the case for Bayes factors, where specification of different 
probability distribution inside each of the hypothesis of 
interest is needed. After observing data x, \etf{9 \x) be the 
posterior density of the parameter 0. The posterior density 
is given by 

f{e\x) (xf{e)L{0;x). 

Suppose one is interested in testing the null hypoth- 
esis H: 0 e ©o- Define the Tangential Set to the null 
hypothesis as 

T^ = {0€® ■.f(0\x) > sup/(^|*)}. 

Oo 

The measure of evidence proposed, the ite-value, is 
defined by 

evAH) = l-P(e €T^\x). 

In words, e-value is the posterior probability of the sub- 
set of the parametric space consisting of points with lower 
posterior density than the maximum achieved under H. 
It is interesting to note the duality between p-values and 
e-values: while the former are tails in the sample distri- 
bution from the observed values under the null hypoth- 
esis, the latter are tail areas in the posterior distribution 
from the sharp hypothesis. E-values are easy to be cal- 
culated and successful papers that use FBST procedure 
in genetics include [14,28,29]. For more on e-values, the 
intuition behind it, asymptotic consistency results, and 
decision- theoretic considerations see [30]. High e-values 
indicate high evidence in favor of the hypothesis, while 
low e-values indicate that the hypothesis is false. 

Implementation of the FBST procedure requires two 
simple steps, which can be performed numerically: 

• Optimization - Finding the supremum of the 
posterior distribution under the null hypothesis. 



sup0^/ {0\x). This is usually done by using built-in 
functions from statistical packages such as R. 

• Integration - Integrating the posterior density over 
the Tangential Set, T^. This step can be done by 
sampling from the posterior distribution by using 
methods such as MCMC. For the problem 
considered here, a usual Monte Carlo method is 
enough to efficiently sample from the posterior. 

More details on the implementation of the FBST proce- 
dure can be found in [26]. To perform the complete FBST 
procedure one also needs to set a cut-off point, that is, one 
must say what a "small" e-value means. Several approaches 
are available: 

• Empirical power analysis [31] 

• Reference sensitivity analysis and paraconsistent 
logic [32]. 

• [30] relate e-values to p-values. 

• Bayesian decision- theoretic approach [33], by the 
specification of a loss function that gives origin to 
FBST procedure. 

• An asymptotically consistent threshold for a given 
confidence level ([34] and [31]). 

The prior distribution for y in the routine that was 
implemented and is available in the website is a Dirichlet 
distribution, as well as the prior distribution for u. The 
family of the Dirichlet priors is widely used in this 
scenario once it is both broad enough to contemplate 
a huge number of different possible prior information 
and yet very easy to be dealt with both mathematically 
and computationally. Here, the two priors are consid- 
ered to be independent, and in the implementation we 
provide online (see Conclusions) the (hyper)parameters 
{aAAi ci-ABi ci-BBi bAAi bABi bBB) are set by the user. That is, 

f{0)<xY\Yr^Y\^-'~\0^®- 

('eG jeG 

Note that in this case the posterior distribution is also 
the product of two independent Dirichlet distributions 
(once they are conjugate with the multinomial distribu- 
tion). Their parameters are {xaa + ciaa>xab + aAB>XBB + 
asB) and iyAA + bAA,yAB + bAB,yBB + bsB) respectively. 
Simulation of the Dirchlet distribution can be efficiently 
done by sampling from Gamma distributions; see [35] for 
details. Note that the case where all (hyper)parameters (a,- 
and hi) are equal to 1, 6 is uniformly distributed a priori. 

The FBST procedure can be used in general, not only for 
testing allelic homogeneity. In particular, it can be used to 
test Hardy- Weinberg equilibrium, as shown by [26] and 
[14]. In order to illustrate the procedure. Figure 1 shows 
the Hardy- Weinberg hypothesis line and the Tangential 
Sets for both case and control groups from the data set 
presented by [16]. Figure 2 does the same for simulated 
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data (not under Hardy- Weinberg equilibrium). They also 
show the 99% HPD (Highest Posterior Density) sets. For 
the sake of neutrality, the prior distribution we use is the 
product of two independent Dirichlet distributions with 
parameters (1, 1, 1), i.e., the uniform distribution in ©. 
Hence, the posterior distribution is proportional to the 
likelihood function, that is, 

f(0\x,y)cxYlYi'Yl''i''^ 

i'eG ie=G 

We see that while both groups from Figure 1 seem to 
be under HWE (in this case, tangential sets have small 
probabilities, and therefore e-values are large), the ones 
from Figure 2 seem to be far from the equilibrium (in this 
case, tangential sets have large probabilities, and therefore 
e-values are small). 

When testing genotypic and allelic homogeneity using 
FBST and uniform priors {ai=bi = 1 for all /'s in the 
Dirichlet distribution), we obtain e-values of 0.434 and 
0.493 respectively. Hence, contrary to what happens to p- 
values, there is more evidence in favor of the allelic homo- 
geneity hypothesis than there is in favor of the genotypic 



homogeneity hypothesis. Therefore the contradiction of 
not rejecting the first hypothesis while rejecting the sec- 
ond one cannot happen for any cutoff that is chosen. In 
fact, as we will show in next Section, this is a property 
of the FBST procedure: the undesirable contradiction can 
never happen. 

Results and Discussion 

We begin this Section by summarizing the results of 
the analyses for data presented in [16], which were 
presented during the exposition of the concepts through- 
out the paper. Results are shown in Table 5. The nota- 
tion for this table is as follows: p'^ is the traditional 
p-value for genotypic association; e'^ is the e-value for 
genotypic association; p^ is the usual (wrong) p-value for 
allelic association; p^ is the p-value for allelic association 
proposed in this paper; and is the e-value for allelic 
association. This table also includes the results (both p- 
values and e-values) for the test of the hypothesis of HWE 
for case and control group. 

Hence, it is reasonable to expect that p-values, as well 
as any other measure of evidence, should be such that 



Case Control 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 



YAA JtAA 

Figure 2 Full Bayesian Significance Test for HWE: simulated data. Geometric representation of the HWE hypothesis (green curve), FBST 
tangential set (continuous ellipsis) and 99% credible set (dashed ellipsis): data from simulated samples (case 26 from Table 6). 
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Table 5 Analysis of real data 



Genotypes 




Alleles 






Hardy-Welnberg 












pHW 


gHW pHW 


gHW 


0.152 0,434 


0.049 


0.069 


0.493 


0.111 


0.276 0,060 


0.165 



Significance indices for liomogeneity for data presented in Tabie 2. 



p{H^) > p(Hq). To sum up, there should be more 
evidence in favor ofHQ than in favor of Hq. In fact, this is 
what motivates the tests proposed by [7]. More generally, 
if we have two nested hypotheses, A Q B Q @, it would 
be desirable to have p(B) > piA). That is, one should 
always believe that B is at least as plausible as A. It is worth 
noting that this inequality must hold if one wants to guar- 
antee that for any significance level a the rejection of B 
will imply the rejection of A. In other words, p(B) should 
always be greater than p(A) so that one will never con- 
clude that A holds but B does not, which is, as we showed, 
logically impossible. 

Even though this logical coherence is desirable, the anal- 
ysis of data presented by [16] (Table 5) shows that this 
property is not achieved neither when using the tradi- 
tional p-value for allelic frequencies, nor when using the 
alternative test presented here. Hence, depending on the 
level of significance used (for example, 10%), one can 
conclude that genotypic homogeneity holds, but allelic 
homogeneity does not. This leads one to a logical contra- 
diction that may be embarrassing for the researcher when 
showing his results to scientific community. Some authors 
(e.g. [36-39]) have already noticed that p-values cannot be 
used as a measure of evidence because they do not respect 
this property. Attempts to correct frequentist tests so that 
they are coherent have been tried in some specific situ- 
ations such as Analysis of Variance [40], but no general 
procedure could be obtained. 

On the other hand, e-values are monotone in the set of 
all possible hypotheses. This can be seen by noting that 

(4) 

For the problem considered here, this means that 
^Vx{Hq) < eVxiHg) will hold for all datasets. Hence, one 
will always have at least as much evidence in favor of //q as 
in favor of H^, and therefore when performing the FBST 
procedure (that is, comparing the e-values with a given 
cutoff) one will never fall into the logical contradiction of 
rejecting while not rejecting H^. Equation 4 proves 
that the incoherence can never happen when using the 
FBST. Table 5 shows that this inequality indeed holds for 
the data presented. It is also interesting to note that in 
the case of nested hypotheses, FBST provides an intrinsic 
penalty that can be used for model selection [41]. 



In Table 6, one can find similar results on simulated 
data. Data was simulated in three different conditions: 
1 - under genotypic homogeneity (and, therefore, allelic 
homogeneity), 2 - under only allelic homogeneity and 
3 - under neither allelic nor genotypic homogeneity. Bold 
p-values indicate situations in which there is incoherence 
in the sense described here. Note that, as it was expected 
due to the proof that was given, none of the samples have 
incoherence when using analyses provided by e-values. 

As mentioned before Hq implies Hq in the sense that 
if genotypic frequencies are the same in both groups then 
allelic frequencies must also be the same. In other words, 
it is impossible for the allelic frequencies to be different if 
the genotypic frequencies are equal. This can be formally 
seen by noting that 

r 1 
true ^Yi = ^i^i e G =^ m + -Kab 

1 , 
= ^AA + -y^AB =^ Hq true. 

An important question is why we use FBST method- 
ology rather then standard Bayes factors, the traditional 
Bayesian procedure to test sharp hypotheses [25]. The rea- 
son is that, contrary to e-values, Bayes factors are also not 
monotonic when dealing with sharp hypotheses as we will 
show here. In order to calculate Bayes factors, one must 
first assign a probability distribution for the parameters 
under each of the hypothesis of interest. In the problem 
we deal with, this means it is necessary to assign prob- 
abilities for 9 under ©, and H^. The Bayes factor 
for hypothesis H is then defined to be [38]. For 

the real dataset presented in [16] (Table 1), when using 
uniform probabilities for 9 in ©, Hq and we have 
a Bayes factor of 6.63 in favor of H^, while of 0.28 in 
favor of Hq, so that lack of monotonicity remains. The 
main reason for this is that it is not necessarily true that 
P{data\H^) < P(data\H^). See [38] for a different exam- 
ple where this happens. An informal explanation of the 
lack of monotonicity is given by [38]: "What the Bayes 
factor actually measures is the change in odds in favor 
of the hypothesis when going from the prior to the pos- 
terior". Note that even though they are not monotonic, 
Bayes Factors provide a great tool for model selection 
[42], a point which we further discuss in the conclusions. 
One may also argue about the merits of using FBST as a 
genuine Bayesian procedure rather than traditional Bayes 
factors. We advocate that while Bayes factors are primarily 
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Table 6 Analysis of simulated data 

Genotypes Alleles Hardy-Welnberg 



Case Control 























1 


0.408 


0.773 


0.197 


Genotypic Homogeneity 
0.189 0.786 


0.540 


0.832 


0.819 


0.971 


2 


0.588 


0.897 


0.648 


0.684 


0.997 


0.030 


0.090 


0.001 


0.002 


3 


0.478 


0.826 


0.483 


0.510 


0.980 


0,496 


0.793 


0.035 


0.119 


4 


0.912 


0.996 


0.709 


0.689 


0.997 


0.172 


0.377 


0.122 


0.287 


5 


0.836 


0.985 


0.578 


0.554 


0.985 


0.224 


0.464 


0.170 


0.378 


6 


0.989 


1.000 


0.926 


0.903 


1.000 


0.000 


0.000 


0.000 


0.000 


7 


0.187 


0.494 


0.100 


0.068 


0.498 


0,027 


0.081 


0.044 


0.124 


8 


0.652 


0.929 


0.444 


0.416 


0.953 


0,338 


0.626 


0.104 


0.257 


9 


0.620 


0.916 


0.510 


0.494 


0.976 


0,192 


0422 


0.761 


0.955 


10 


0.565 


0.888 


0.923 


0.912 


1.000 


0,001 


0.003 


0.057 


0.153 


1 1 


0.008 


0.034 


0.325 


Allelic Homogeneity 

0.291 0.893 


0.494 


0.790 


0.001 


0.003 


1 2 


0.000 


0.000 


0.06/ 


O.Ob/ 


0.--1-12 


O.O08 


0.1 90 


0.000 


0.000 


13 


0.002 


0.013 


0.151 


0.114 


0.629 


0,989 


1.000 


0.000 


0.000 


14 


0.001 


0.003 


0.923 


0.918 


1.000 


0.174 


0.400 


0.000 


0.000 


15 


0.113 


0.342 


0.844 


0.833 


1.000 


0,989 


1.000 


0.006 


0.014 


16 


0.020 


0.086 


0.559 


0.547 


0.985 


0,174 


0.395 


0.015 


0.040 


17 


0.001 


0.006 


0.147 


0.129 


0.683 


0,129 


0.319 


0.002 


0.005 


18 


0.040 


0.149 


0.501 


0.462 


0.970 


0,871 


0.986 


0.001 


0.002 


19 


0.026 


0.106 


1.000 


1.000 


1.000 


0.760 


0.955 


0.000 


0.000 


20 


0.001 


0.002 


0.446 


0.379 


0.939 


0,733 


0.938 


0.000 


0.000 


21 


0.000 


0.000 


0.925 


No Homogeneity 

0.928 1.000 


0,000 


0.000 


0.015 


0.045 


22 


0.843 


0.987 


0.646 


0.618 


0.993 


0,055 


0.1 53 


0.141 


0.333 


23 


0.062 


0.219 


0.104 


0.124 


0.661 


0,989 


1.000 


0.007 


0.028 


24 


0.669 


0.939 


0.403 


0.408 


0.955 


0.994 


1.000 


0.621 


0.882 


25 


0.000 


0.000 


0.000 


0.000 


0.003 


0,000 


0.000 


0.771 


0.958 


26 


0.105 


0.331 


0.017 


0.047 


0.403 


0,001 


0.001 


0.001 


0.001 


27 


0.000 


0.000 


0.000 


0.000 


0.012 


0,072 


0.197 


0.010 


0.033 


28 


0.180 


0.485 


0.230 


0.233 


0.835 


0,310 


0.598 


0.324 


0.602 


29 


0.134 


0.387 


0.068 


0.045 


0.389 


0.045 


0.128 


0.063 


0.170 


30 


0.807 


0.980 


0.522 


0.517 


0.980 


0,806 


0.971 


0.713 


0.933 



Results of the simulations under three different scenarios: genotypic homogeneity, allelic (but not genotypic) homogeneity and no homogeneity at all. Bold p values 
indicate incoherence. 



motivated by the epistemological framework of Decision 
Theory and p-values are supported by Popperian falsifica- 
tionism, e-values and FBST are supported by the frame- 
work of Cognitive Constructivism. The reader is referred 
to [43-45] for more epistemological considerations and 
comparisons of these methods. It is also interesting that 



FBST can also be justified as a minimization procedure of 
a loss function, as shown by [33]. This makes e-value also 
compatible with standard Decision Theory and therefore 
traditional Bayesian statistics. We emphasize that when- 
ever hypotheses are not sharp, posterior probabilities are 
usually more adequate. 
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We end up this Section by answering the question 
of whether FBST procedure has good power properties. 
Even though this is not of primary interested in this 
work and is not a relevant question for most orthodox 
Bayesians, we indicate that this Bayesian procedure has 
good frequency properties. In order to do this, we fix dif- 
ferent values for yAA> Yab and tiaa- We then set tzab to 
be 2 {yaa + l/2yAS — tt^aa — €) for different values of e. e 
quantifies how far from allelic homogeneity the popula- 
tion is. The particular case 6=0 corresponds to allelic 
homogeneity, that is, to a true hypothesis. For each value 
of e, we simulate 100 data sets with 100 samples of cases 
and 100 samples of controls group. We then calculate the 
proportion of samples in which allelic homogeneity was 
rejected according to each criteria. We use levels of signif- 
icance of 5% and 10%. The relationship provided by [30] 
is used to determine the cutofi^s for e-values that make 
FBST have the desired level of significance. Results are 
shown in Figure 3. These graphs indicate that the usual 
test for allelic homogeneity has a larger power than the 
others. This conclusion is misleading, once the size of the 
test is not the nominal one, as we discuss in Section Usual 
Procedures. This can be seen by looking at the curve at 



6 = 0 and noting that the power (which for 6 = 0 is 
the size of the test) is larger than 5% and 10% respec- 
tively. For more simulations regarding this test power, the 
reader is referred to [17]. This figure also shows that the 
power of the frequentist allelic test proposed here and the 
FBST test are virtually the same: even though FBST struc- 
ture guarantees coherence in the results and frequentist 
tests do not have the property, their power are very close 
to each other. Hence, the FBST procedure also has good 
frequency properties. 

Conclusions 

Although the traditional approach of doubling the sam- 
ple size to test allelic homogeneity hypothesis was already 
shown to be incorrect when Hardy- Weinberg equilib- 
rium is not met, many recent articles in biology still use 
it. As Figure 3 illustrates by using power analysis func- 
tions, the nominal level of significance for the allelic usual 
test is not attained: at zero in the x-axis, the power is 
larger than 5%, contrary to the alternative ones. We have 
shown in this paper that a logical inconsistency that hap- 
pens when using such procedure remains even when 
using adjusted frequentist tests. The main point of this 






Figure 3 Power analysis of Full Bayesian Significance Test. Comparison of power of different tests for allelic homogeneity. Horizontal lines show 
level of significance. Topleft: ym = 1 /5, yab = 2/5, ttm = 1 /4, a = 5%, topright: ym = 1 /5, yab = 2/5, jtaa = 1 /4, o? = 1 0%, bottomleft: 
YAA = 1 /3, YAB = 1 /5, TTAA = 1 /3, a = 5% bottomright: [yaa = 1 /3, yab = 1 /5, ttaa = 1 /3, a = 1 0%. 
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inconsistency is the fact that if two vectors are equal 
any function of them must maintain the equality. The 
fact that even when using an exact approach incoherence 
remains hints that the problem is the change of dimension 
when going for global homogeneity to partial homogene- 
ity: genotypic homogeneity is in dimension 2 (two degrees 
of freedom) and allelic homogeneity is in dimension 1 
(one degree of freedom). As Wald Tests, Likelihood Ratio 
Tests, and Chi-Square tests are asymptotically equivalent, 
it is also expected that contradictions may happen to all 
of them. 

Similar incoherences of p-values in other situations 
have already been reported in the literature. As a sim- 
ple ANOVA-like example, suppose we wish to compare 
the means of independent random variables from 3 differ- 
ent groups, /Lti, and jxj,. If we assume their distribution 
is normal with variance 1 and the sample means in each 
group (sufficient statistics) are -0.192, 0.015 and 0.017, 
the likelihood ratio p-value for the hypothesis ^.l = 112 is 
0.037. On the other hand, when testing /xi = /X2 = Ms 
we get a p-value of 0.054. Hence, at the level of 5%, the 
first hypothesis is rejected, but the second one is not. 
This makes it debatable whether it reasonable to use them 
as measures of evidence [37]. On the other hand, if we 
use the improper prior /(/t^i, M21 Ms) oi. 1, the e-values 
are 0.232 and 0.121, respectively. Hence the contradiction 
cannot happen for any cutoff. 

As probabilities are monotonic, traditional Bayesian 
tests based on posterior probability calculations do enjoy 
monotonicity property, however using them here may be 
problematic because the hypotheses of interest are sharp. 
Mixed continuous-discrete distributions are needed in 
this case. Bayes Factors, on the other hand, were shown to 
be not monotonic. This does not invalidate its use: in fact, 
as pointed out by [38] and [42], Bayes Factors provide a 
great tool for model selection. One of the reasons for this 
is that parsimonious models can have better predictive 
power than complex models [46]. 

The FBST computation always is performed in the full 
space that has dimension 4. Hence subhypotheses should 
coherently follow the orientation of the main hypothesis. 
Moreover, there is no need of specifying special priors for 
each of the null hypotheses, only for the whole parametric 
space 0. It can also be easily implemented. The problem 
with the FBST is that the values of the significance index, 
"e", are related to the dimension and increase as the dimen- 
sion increases. However, in [47] it is shown how "e" relates 
with "p" This allows one to look for the corresponding e- 
value for 5% of significance for instance. Another point 
in favor of the FBST is that its power is almost the same 
as the best frequentist test. Moreover, it is correct even 
when HWE does not hold. It is important to remember 
that e-values are probabilities of subsets of the parameter 
spaces although p-values are probabilities of sets (tails) of 



the sample spaces. On the other hand one must under- 
stand that hypotheses are statements about points of the 
parameter space and not of the sample space: May this 
explain the reason why the e-values, contrary to p-values, 
are coherent in all situations? 

Using the R Software, a routine that performs all the 
tests considered in this paper can be downloaded on www. 
ime.usp.br/~cpereira/programs/nested.r 
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